Karthik Durvasula: Perception as sampling with categorical representations

Background

There is by now a wealth of information that perception results in gradient behaviour, see McMurray (2022) for a recent review. In fact, when we look at perceptual judgements, in say an identification task, one sees a (sigmoidal) cline that moves from the judgement in favour of one category to that in favour of the other. The question at hand today is: can researchers infer gradient representations from this sort of behavioural output?

A common whipping boy in such discussions is the theory of Categorical Perception (Liberman et al. 1957), which in its extreme (likely, uncharitable¹) form makes a set of claims:

Extreme version of the theory of Categorical Perception
- Claim₁: Speech perception is in terms of discrete categories.
- Claim₂: Speech perception involves a single/unique percept for a given input.
- Claim₃: Speech is special and automatic.

To put it together: when speech is perceived, the speech perceptual system is automatically activated (and general audition is, I guess, deactivated), and it is in that domain that an input is perceived as a unique linguistically relevant category. And since only categories are available, the associated perceptual output should be a discrete-/step-function. I am not actually sure that the original authors meant for such categorical perception to affect general auditory processing. Note, given that they claimed that input speech automatically trigger the speech perception system, I don’t quite see how they could have made the claim. However, this is how the theory has been interpreted in the literature, and for the sake of this blog post, I will go along with the same extreme position.

Of course, we can’t just infer that there is gradience in the perceptual system just because there is gradience in the observed average perceptual behaviour. Massaro and Cohen (1983) correctly observed that a trivial sort of observed gradience can be achieved through an averaging artefact. Let’s say that different subjects have slightly different categorical boundaries, then it is easy to achieve gradience in the averaged value across all participants simply through an average over the multiple discrete categorisations of the subject.

Here is a simple piece of code to show that. Imagine that we are looking at Voice Onset Time (VOT), where the average categorical boundary is about 25-30 ms. I will just say it is 30 ms, and further assume the standard deviation across subjects is about 10. Furthermore, let’s assume that the categorical boundaries are normally distributed across subjects,² and each subject in fact has a categorical percept of an input based on their categorical boundary. Finally, let’s look at an input VOT value of 30 ms. As can be seen in the output of the function, the actual response for each subject is categorical (either Cat1 or Cat2).

categoricalPerceiver = function(input, boundary, boundarySD, reps){
  
  # Getting the categorical boundaries 
  CategoricalBoundaries = rnorm(reps, mean=boundary, 
                                sd = boundarySD)
  
  
  #Creating a data.frame with repetition information.
  Results = data.frame(Reps = 1:reps)
  
  #Adding the subject number and their categorical boundary
  #Then giving a percept based on the boundary
  Results %>% 
    mutate(Percept = ifelse(input<CategoricalBoundaries,"Cat1","Cat2"))
}

#Output of the categorisation event for each subject
categoricalPerceiver(30, 30, 10, 5)

  Reps Percept
1    1    Cat1
2    2    Cat1
3    3    Cat1
4    4    Cat1
5    5    Cat1

OK, now, using the above function, we can create a new function, just for convenience, that will generate the categorical judgements for multiple subjects for a range of VOT values. Furthermore, in order to numerify these categories and get a proportion, we will code one of the categories as 0 and the other as 1 (low VOT_voiced = 0, and high VOT_voiceless = 1). As can be seen in the plot of averaged responses over all the participants, there appears to be a cline even though each participant had a categorical percept for each value of the VOT continuum.

#Plot of averaged values
categoricalPerceiverForInputRange = function(input,boundary,boundarySD,reps) {

  (inputRange = seq(0,60,by=3))
  
  data.frame(input = inputRange) %>% 
    mutate(inputCopy = input) %>% 
    group_by(input) %>% 
    nest() %>% 
    mutate(Percepts = map(data,function(.x){
      categoricalPerceiver(input=.x$inputCopy,
                           boundary,
                           boundarySD,
                           reps)})) %>% 
    select(-data) %>% 
    unnest(Percepts) %>% 
    mutate(PerceptBinary = ifelse(Percept=="Cat1",0,1)) %>% 
    group_by(input) %>% 
    summarise(meanPercept = mean(PerceptBinary))
}

categoricalPerceiverForInputRange(input=30,
                                  boundary=30,
                                  boundarySD=5,
                                  reps = 100) %>% 
  ggplot(aes(x=input, y=meanPercept)) +
  geom_point() +
  geom_line() +
  xlab("Input value") +
  ylab("Proportion of Cat2 percept")

To avoid the averaging artefact issue across subjects, Massaro and Cohen (1983) asked their subjects to rate how /b/-like or /p/-like a given input stimulus was. There is an issue here that if the subject really sees categories as discrete, then asking them a question like this is forcing them to re-interpret the question as something else. As Armstrong, Gleitman, and Gleitman (1983) point out, if you give participants a rating task of “how good an exemplar is” of primeness/evenness/oddness, they will gladly reinterpret the question to mean something else and respond with gradient responses, despite the fact that all three concepts are clearly and necessarily categorical. We will set this issue aside for now and try to be charitable to the researchers and say that the task is tapping into something real.

Massaro and Cohen (1983) reasoned that if the subjects were hearing the stimuli categorically, and if the observed gradience is just an averaging artefact over multiple subjects, then when we look at single subjects, we shouldn’t see such (smooth) clines, and we should instead see a step function for each subject, that switches at the categorical boundary of that speaker. What they argue is that, in fact, even when each subject is looked at separately, they were able to observe smooth clines. To be honest, when I look at their results, I don’t see the smooth cline they mentioned, but their argument comes from comparing the statistical fits of the two models (the categorical percept model and the gradient percept model) and showing that the latter model fits the data better. OK, again, in an effort to be charitable, let’s grant the fact that single subjects don’t show step functions.

Based on such facts, people have argued for: (a) gradient category activation, a term that many psychologists love, and many linguists don’t understand; (b) gradient representations, a term that is actually not formally defined in such work; (c) both (a-b), essentially because they are used interchangeably; (d) a probabilistic percept, where an inverse probability is attached to a discrete category.

Out of these, the first three could be argued to be incompatible with categorical representation (although, what is an “activation”?), but option (d) straight-forwardly is compatible with categorical representations. But, even with option (d), one might complain that attaching probabilities to a discrete category means that the inferred representation must include some sort of gradient/numerical information, along with categorical information, and therefore on its face, it’s an argument against purely categorical representations.

However, that way of thinking about probabilistic percepts is simply confusing a computational level description with an algorithmic description, in terms of Marr (1982). The simple point is that just because there is utility for numerical descriptions attached to representations at the computational level, it doesn’t mean that the underlying algorithm actually uses gradient representations.

So, one can ask if the observation of within speaker gradience in perception spells doom for any view that espouses categorical percepts? It is clear that the extreme version of the theory of Categorical Perception laid out above will have to be sent off to Hades. However, that viewpoint has three different claims together, and observing that some set of facts contradicts a view that espouses all three claims, is not equivalent to arguing that each of the three claims has been disproven. Essentially,

\(\overline{Claim_{1}\cap Claim_{2}\cap Claim_{3}}\neq \overline{Claim_{1}} \cap \overline{Claim_{2}} \cap \overline{Claim_{3}}\)

Equating the right hand side with the left hand side would be a violation of De Morgan’s laws! The correct inference if the theory of Categorical Perception is falsified is that at least one of the claims is wrong, namely,

\(\overline{Claim_{1}\cap Claim_{2}\cap Claim_{3}} = \overline{Claim_{1}} \cup \overline{Claim_{2}} \cup \overline{Claim_{3}}\)

Before we say any of these claims is falsified, it is worth noting that the same averaging artefact idea discussed above is possible within subjects too, if the subjects are each presented with multiple repetitions of the same stimuli. While there is a categorical boundary mean for each subject, there is still some amount of noise/variance in that mean, then when you average across the multiple tokens, you can get a cline using exactly the same code as above.

Now, one might claim bloody murder, and push back to argue that, if we presented the stimulus just once to a subject and observed the perceptual cline, it would show that perception is gradient. But, that claim rides on the subject perceiving the input as a unique percept. But, if we give up on Claim₂,³ then the data can be accounted for as an averaging artefact within a subject.

Say, that a subject in fact hears the input and tries to categorise the input multiple times, and not just once, and the final response is the average over those multiple times, then again, the above reasoning will hold, and we can get a perceptual cline even if each stimulus is presented once. I will call this last view the Perception as Sampling view.

The thing to note is that, above, I bent over backwards to accept the claims (experiment design, results,…) of those who espouse gradient perceptual representation, though each of those claims by itself could/should be questioned. And still, it is clear that there is no tension between the observation of gradience and categorical perceptual representations. Once we allow a theory which sees perception as sampling, we can happily maintain that the representations involved are categorical and account for all the data that I am aware of, in fact in a perspicuous way. It also leads to interesting consequences that have often been ignored in the literature.

The central issue pointed out above is that the arguments against categorical percepts, are really as argument against the extreme version of the theory of Categorical Perception (which involves a set of claims). The arguments don’t extend to all theories that espouse categorical representations. This sort of error has been made repeatedly in the literature arguing against categorical representations, see Du and Durvasula (2025) for more examples and elaboration.

How do judgements change with task complexity?

The Perception as Sampling view has an interesting consequence. As observed in previous work, some experimental paradigms (identification or AX or ABX paradigms) result in steeper curves, but others such as three-stimulus oddball paradigms⁴, or 4IAX paradigm⁵ show much much shallower clines.

Many have argued that this means that the identification task or AX or ABX paradigms are not appropriate, and an oddball or 4IAX task are more appropriate. The reasoning is fleshed out in terms of task complexity. For example, McMurray (2022) states that the former tasks are “more memory intensive, and may be confounded by different response criteria on the part of subjects” than the latter set of tasks.

However, I don’t for the life of me understand how a task involving just one or two stimuli is somehow more memory intensive than a task with 3 or more stimuli. To do the oddball task, the participant necessarily has to hold all three stimuli in memory to decide, and to do the 4IAX task, the participant has to compare the first two or the latter two and then decide. So, along with looking at a pair of sounds, they need to store a temporary decision too. In both cases, the tasks are at least as intensive as the former tasks. So, my understanding of the perceptual facts is that the perceptual cline becomes less steep with a more memory intensive task, and not the other way around as has been claimed in the literature.⁶

Given the Perception as sampling theory I laid out above, we can actually understand this effect of memory on the perceptual function. Task complexity (or memory load) can be cashed out as more variability in the output due to limited resources. If there is more of a memory load, the decision is likely to be less accurate, as memory resources are likely used for other things. In mathematical terms related to a decision boundary, this means that the variance associated with a decision boundary effectively increases. Assuming the sources of such complexity are independent of the categories or category values in any way, the effect of the additional sources of variability is going to be additive. Consequently, the variance and the standard deviation related to the categorical boundary will be higher.

Let’s say that the standard deviation related to the categorical boundary increases by 15 units in a high memory load paradigm compared to a low memory load paradigm, then the perceptual function is going to be less steep for the former as in the figure below.

#Creating a data.frame with input values for each memory load condition
LowMemoryLoad = categoricalPerceiverForInputRange(input=30,
                                                  boundary=30,
                                                  boundarySD=5,
                                                  reps = 100) %>% 
  mutate(LoadType = "Low memory load")
  

HighMemoryLoad = categoricalPerceiverForInputRange(input=30,
                                                   boundary=30,
                                                   boundarySD=20,
                                                   reps = 100) %>% 
  mutate(LoadType = "High memory load")

#Plot of averaged values
rbind(LowMemoryLoad,HighMemoryLoad) %>%
  ggplot(aes(x=input, y=meanPercept,colour=LoadType)) +
  geom_point() +
  geom_line() +
  xlab("Input value") +
  ylab("Proportion of Cat2 percept")

How do the reaction times change with stimulus input?

In the previous section, I discussed how a within-subject and within-stimulus gradience can still be cashed out as an averaging artefact through multiple sampling. There are at least two algorithmic implementations of this idea that can both account for the within-single stimulus gradience: (1) where the sampling happens in parallel, (2) where the repeated sampling happens in serial until some threshold has been reached. In the previous section, I sort of tacitly went down the first route (1), but in what follows, I will try to show that the second/serial way to implement the algorithm has interesting/useful virtues.

The first virtue is that it allows us to better understand the memory load effect mentioned above. Note, I sort of hard-waved and said task complexity leads to more variance. But, if we see the sampling procedure as serial, we can actually derive it. When the task is more complex and there are more things that the subject needs to attend to, the window for working memory for the perceptual task is likely to be smaller. Therefore, if perception is serial, there are simply fewer samples taken from the input; and when there are fewer samples, there is likely to be more variability in the perceived event.⁷

A second virtue of the serial sampling model is related to understanding reaction times (RTs) in perception experiments. An important aspect of perception is that auditory distance/confusability between two sounds leads to longer RTs in judging them to be same/different. For example, in an AX task, two sounds close to the categorical boundary will elicit longer RTs, while those away from the categorical boundary will elicit shorter RTs. Interestingly, if the sounds are identical or if they are quite different, then the RTs are much faster. So, generally, one can’t say that shorter auditory distances lead to longer RTs, because this would result in identical stimuli having the longest RTs. Furthermore, though the heuristic is often used in perceptual testing, it is rarely justified or derived. In fact, if there are truly gradient representations, it is not clear to me why the RTs should be longer for more auditorily similar stimuli. Furthermore, it is not clear how we even establish identity if the representations are truly gradient.

Here, the serial sampling model allows us to make progress on actually deriving the effect of auditory similarity on RTs. In this implementation, the sampling happens in a serial fashion, and the sampling stops when the algorithm converges on a certain categorical percept. Here, convergence means that the additional repeated sampling doesn’t result in substantial change to the proportion of each category inferred from the input (the proportional threshold is defined 0.01 for the purposes of the code below, but the reasoning follows for other thresholds).

The code for the serial sampling procedure is given below. It essentially samples the input multiple times, and then decides on a categorical percept depending on the steady-state proportions achieved during the repeated sampling. We will use the number of samples needed for convergence as our proxy for RTs.

We can now simulate the results of the serial sampling procedure. As can be seen below, the RTs are longest (i.e., the sample count is highest) close to the categorical boundary. And as we move away from the categorical boundary, the RTs get shorter.

# Plotting RTs (with total count as a proxy) across different input values

#Same input range as before
inputRange = seq(0,60,by=1)

#Plotting number of steps (proxy for RTs)
data.frame(input=inputRange,
             inferenceTime=RepeatedSamplerTillConvergence(input=inputRange,
                                                    threshold=0.01,
                                                    boundary=30,
                                                    boundarySD=20,
                                                    reps=1)) %>% 
  ggplot(aes(input,inferenceTime))+
  geom_point()+
  geom_smooth()+
  xlab("Input value")+
  ylab("Number of samples for convergence (in lieu of RTs)")

To sum up, the serial sampling model of perception has some useful properties that make it interesting: namely, it actually lends to a deeper understanding/derivation of the effect of memory load on perception, and it allows us to understand/derive the link between RTs and auditory distance.

Conclusion

There has been a lot of discussion on the need for gradient representations in the modern speech perception literature. However, they often ride on falsifying a very naive (some might say extreme) interpretation of the theory of Categorical Perception. However, the theory comes with multiple claims. Therefore, even if one were to falsify the conjuct of all the claims, it doesn’t follow that each of the claims is falsified.⁸

One might ask, why does it matter that there exists a view of perception in terms of categorical representations that is consistent with the facts we know? There are two reasons for this. First, phonological/lexical representations as we currently understand them are categorical representations (see Du and Durvasula (2025) for extensive discussion); therefore, perception has to result in representations of the same type, otehrwise we have an incommensurability problem. Perception in terms of categorical representations trivially achieves this desideratum.

Second, categorical perceptual representations can be thought of as strict subsets of gradient perceptual representations. Given that they form strict subsets, they form smaller hypothesis spaces. In fact, given the lack of gradience, the hypothesis space is really speaking finite, compared to the uncountably infinite space of gradient perceptual representations. So, gradient representations generally have much more flexibility than categorical representations in accounting for data. However, it is important to separate accounting for data from explaining the data — a more complex (superset) representation will always account for more data simply because it has more flexibility, but what matters is how well it explains the data or for that matter truly predicts unseen/untrained data ( when severely tested). Therefore, while flexibility can be seen as a virtue in some domains, it is distinctly not a virtue in theory construction where we want to understand the system. Furthermore, as has been extensively argued in the philosophy of science literature, simplicity is a virtue of a theory (Goodman 1967, 1943).⁹ Goodman further argues that simplicity is an important criterion one employs in distinguishing between achieving a true scientific statement and simple curve fitting. While, no doubt, one can easily arm this discussion with an opposing list of other philosophers who disagree with Goodman’s claims, there is a core aspect of the claim that I believe is worth retaining, particularly with respect to the nature of gradience in perceptual representations, or linguistics more generally. Furthermore, entertaining simpler, more parsimonous, representations allows us to explore the infinite space of logically possible perceptual representations in a tractable fashion. I guess, an alternative research strategy, to be perfectly consistent with seeing flexibility as a benefit, is one where one chooses the most expansive representations with the argument that it is most flexible; however, such a representation is ill-defined given the space of possibilities.

Armstrong, Sharon Lee, Lila R. Gleitman, and Henry Gleitman. 1983. “What Some Concepts Might Not Be.” Cognition 13 (3): 263–308.

Du, Naiyan, and Karthik Durvasula. 2025. “Psycholinguistics and Phonology: The Forgotten Foundations of Generative Phonology.” Cambridge Elements on Phonology.

Glymour, Clark N. 1980. Theory and Evidence. Princeton University Press.

Goodman, Nelson. 1943. “On the Simplicity of Ideas.” The Journal of Symbolic Logic 8 (4): 107–21.

———. 1963. “Simplicity & Truth in Science.”

———. 1967. “Uniformity and Simplicity.” Edited by Jr. Claude C. Albritton.

Liberman, Alvin M, Katherine Safford Harris, Howard S Hoffman, and Belver C Griffith. 1957. “The Discrimination of Speech Sounds Within and Across Phoneme Boundaries.” Journal of Experimental Psychology 54 (5): 358.

Liu, Peng, and Zhizhong Li. 2012. “Task Complexity: A Review and Conceptualization Framework.” International Journal of Industrial Ergonomics 42 (6): 553–68. https://doi.org/https://doi.org/10.1016/j.ergon.2012.09.001.

Marr, David. 1982. Vision: A Computational Approach. San Francisco: Freeman & Co.

Massaro, Dominic W, and Michael M Cohen. 1983. “Categorical or Continuous Speech Perception: A New Test.” Speech Communication 2 (1): 15–35.

Mayo, Deborah G. 1996. Error and the Growth of Experimental Knowledge. University of Chicago Press.

McMurray, Bob. 2022. “The Myth of Categorical Perception.” The Journal of the Acoustical Society of America 152 (6): 3819–42.

As Spencer Caplan pointed out to me, the original authors suggested that the following might be an extreme interpretation of their results.↩︎
I refer to subjects as “repetitions” in the code so that it generalises across different issues, as you will soon see.↩︎
I don’t have much to say about Claim₃.↩︎
Where the participant is presented with three stimuli and asked to identify the odd one.↩︎
Where a participant is presented with two pairs of sounds, where one pair is identical and the other is different. Then, the participant is asked which pair is the same.↩︎
I will acknowledge that our understanding of how complexity affects performance is quite limited, and very much debated. See Liu and Li (2012) for a recent review.↩︎
This is of course the relationship between the width of the sampling distribution and sample size in inferential statistics.↩︎
This is of course related to the famous “tacking paradox” that confirmationist logics have as discussed in the philosophy of science literature (Glymour 1980; Mayo 1996).↩︎
“Simplicity, in at least some respect, is a test of truth.” (Goodman 1963).↩︎

Comment on this article Share:

Perception as sampling with categorical representations

Background

How do judgements change with task complexity?

How do the reaction times change with stimulus input?

Conclusion

References

Citation